2025-01-06
Importance of human judgment, context-knowledge
High quality data, in comparison to automated methods such as dictionnaries
Highly time consuming, human labor intensive and costly
Often need to rely on a small sample of texts, which can be biased
| Machine Learning Lingo | Statistics Lingo |
|---|---|
| Feature | Independent variable |
| Label | Dependent variable |
| Labeled dataset | Dataset with both independent and dependent variables |
| To train a model | To estimate |
| Classifier (classification) | Model to predict nominal outcomes |
| To annotate | To (manually) code (content analysis) |
Licht et al. (2024) : A supervised learning workflow
Do, Ollion, and Shen (2024) : Policy vs Politics classification task

Split the annotated sample into a training set and a test set (70/30 rule)
A supervised learning model is a model that learns a mapping function from input features to output labels based on a training set
The model “learns” the underlying relationship between the input features and the labels by minimizing a loss function that measures the difference between the predicted labels and the true labels
During the training phase, the model adjusts its parameters to minimize the loss function and make better predictions
Models differ in the way they learn this mapping function
Function then used to predict the labels of the test set, giving us a pro
Return the probability of a text belonging to a category
Use the test set to evaluate the performance of the model
The model is used to predict the categories of the texts in the test set
The predictions are compared to the true categories with different metrics
Accuracy : proportion of correctly classified texts (highly limited for imbalanced datasets)
\[ \ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \ \]
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
\[ \ \text{Recall} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Negative}} \ \]
\[ \ \text{Precision} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Positive}} \ \]
\[ \ \text{f1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \ \]
| Problem | Solution |
|---|---|
| Unbalanced classes | Undersampling & oversampling |
| Not enough training data | More annotation |
| Bad quality of the training data | Better annotation |
| Bad quality of the text features | Better preprocessing |
| Limited text representation | Go for more complex models |
| Too complex concept | Accepting okay-ish performance |
Do, Ollion, and Shen (2024)
Peterson and Spirling (2018) : accuracy as a measure of polarization
Licht et al. (2024) : predicting the use of anti-elite strategies
Sattelmayer forthcoming : the effect of party position on immigration on vote switching to the far right
Supervised text classification